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Abstract 

Latent Dirichlet Allocation (LDA) is a 
three-level hierarchical Bayesian model 
for topic inference. In spite of its great 
success, inferring the latent topic distribu¬ 
tion with LDA is time-consuming. Mo¬ 
tivated by the transfer learning approach 
proposed by Hinton et al. (2015), we 
present a novel method that uses LDA to 
supervise the training of a deep neural net¬ 
work (DNN), so that the DNN can approx¬ 
imate the costly LDA inference with less 
computation. Our experiments on a docu¬ 
ment classification task show that a simple 
DNN can learn the LDA behavior pretty 
well, while the inference is speeded up 
tens or hundreds of times. 

1 Introduction 

Probabilistic topic models, for instance Latent 
Dirichlet Allocation (LDA) (Blei et al., 2003), 
have been extensively studied and widely used 
in applications such as topic discovery, document 
classification and information retrieval. Most of 
the successful probabilistic topic models are based 
on Bayesian networks (Hofmann, 1999; Teh et al., 
2006), where the random variables and the depen¬ 
dence among them arc carefully designed by peo¬ 
ple and so hold clear meanings in physics and/or 
statistics. For this reason, Bayesian topic mod¬ 
els can represent the document generation process 
well and have attained much success in semantic 
analysis and related research. 

A particular - problem of Bayesian topic models, 
however, is that when the model structure is com¬ 
plex, the inference for the latent topic distribution 
(topic mixture weights) is often untractable. Var¬ 
ious approximation methods have been proposed, 
such as the variational approach and the sampling 
method, though the inference is still very slow. 


Recently, Hinton et al. (2015) proposed a trans¬ 
fer learning approach. In this approach, a com¬ 
plex model is used as a teacher model to super¬ 
vise the training of a simpler model. The origi¬ 
nal proposal used a complex deep neural network 
(DNN) to train a simple shallow neural network 
and obtained performance very close to the com¬ 
plex DNN. This motivated our current research 
that attempts to use a Bayesian model to super¬ 
vise the training of a neural model. Specifically, 
we use an LDA as the teacher model to guide the 
training of a DNN, so that the DNN can approxi¬ 
mate the behavior and performance of LDA. A big 
advantage of this transfer learning from LDA to 
DNN is that inference with DNN is much faster 
than with LDA. This solves a major difficulty of 
LDA on large-scale online tasks. 

We tested the proposed method on a document 
classification task. The results show that a simple 
DNN model can approximate LDA pretty well and 
the inference speeds up tens or hundreds of times. 
Interestingly, a preliminary analysis shows that by 
the transfer learning, the DNN model seems can 
discover topics similar to those learned by LDA, 
although this information is not explicitly pre¬ 
sented in the transfer learning. 

2 Related work 

This work develops a neural model to approxi¬ 
mate the function of LDA (Blei et al., 2003), with 
a direct goal of a fast inference. Compared to 
the early probabilistic models such as pLSI (Hof¬ 
mann, 1999), LDA treats the topic mixture as a 
latent variable rather than a deterministic parame¬ 
ter. This leads to a full generative model that can 
deal with new documents, but also causes much 
more computation in model inference. The DNN- 
based LDA approximation presented in this paper 
attempts to solve this problem. 

Our work is also closely related to the deep 
learning research that was largely initiated by Hin- 



ton et al. (2006). DNN is a popular deep learning 
model and is capable of learning complex func¬ 
tions and inferring layer-wise patterns. This work 
leverages these advantages and uses DNNs to ap¬ 
proximate LDA. Note that deep learning has been 
employed in topic modeling, e.g., the approach 
based on deep Boltzmann machines (DBM) (Hin¬ 
ton and Salakhutdinov, 2009; Srivastava et al., 
2013). The difference of our work is that we focus 
on approximating a well-trained Bayesian model 
using a deep neural model, instead of learning the 
deep model from scratch. 

Finally, this research is directly motivated by 
the dark knowledge distiller model (Hinton et al., 
2015) that employs the knowledge learned by a 
complex DNN to guide the training of a simpler 
DNN, or vice versa (Wang et al., 2015). In this 
work, we extend this method to learn a neural 
model with the supervision of a Bayesian model, 
which is more ambitious and challenging. 

3 Methods 

For a particular - document d, LDA takes the term 
frequency (TF) as the input, denoted by v(d). The 
inference task is then to derive the topic mixture 
6(d), which is actually the posterior probability 
distribution that the document belongs to the top¬ 
ics. In tasks such as document clustering or clas¬ 
sification, 6(d) is a good representation for docu¬ 
ment d, with a low dimensionality and a clear se¬ 
mantic interpretation. 

Exact inference with LDA is untractable and 
so various approximation methods are usually 
used. This work chooses the variational inference 
method proposed by Blei et al. (2003), which in¬ 
volves iterative update of the document and word 
topic mixtures and hence time-consuming. The 
basic idea of the LDA to DNN knowledge transfer 
learning is to train a DNN model which can simu¬ 
late the behavior of LDA inference, but with much 
less computation. More precisely, the DNN model 
learns a mapping function f(v(d);w) such that 
f(v(d);w) approaches to 6(d), where w denotes 
the parameters of the DNN. Note that 6(d) is a 
probability distribution. To approximate such nor¬ 
malized variables, a softmax function is applied to 
the DNN output and the cross entropy is used as 
the training criterion, given by: 

K 

£( w ) = ~^2^2d(d)i log f(v(d)-,w)i (1) 

d i= 1 


where K denotes the number of topics and the 
subscript i indexes the dimension. Once the DNN 
is trained, the mapping function f(v(d); w) learns 
the behavior of the LDA model and can be used to 
predict 6(d) for new documents. Compared to the 
LDA inference, f(v(d); w) can be computed very 
fast and hence amiable to large-scale online tasks. 

We experimented with two DNN structures: a 
2-layer DNN (DNN-2L) that involves one hid¬ 
den layer, and a 3-layer DNN (DNN-3L) that in¬ 
volves two hidden layers. In DNN-2L, the num¬ 
ber of hidden units is twice of the output units; 
in DNN-3L, the number of hidden units are three 
and two times of the output units for the first 
and second hidden layer, respectively. The hyper¬ 
bolic function is used as the activation function. 
The training employs the stochastic gradient de¬ 
scent (SGD) method, and is implemented based 
on Theano (Bastien et al., 2012) 1 . 

Note that we have assumed that the topics have 
been learned already. In fact, learning topics is 
even slower than inferring the topic mixtures. For 
example, the empirical Bayesian method proposed 
by Blei et al. (2003) involves an alternative varia¬ 
tional EM procedure, which is rather slow. How¬ 
ever, since the model training can be conducted 
off-line, it is not a big concern for online tasks. 

4 Experiments 

4.1 Database and experimental setup 

The proposed methods are tested on the docu¬ 
ment classification task with two datasets. The 
first dataset is Reuters-21578 and we follow the 
‘LEWISSPLIT’ configure to define the training 
and test data. The documents are labelled in 55 
classes. 2 

The second dataset is 20 Newsgroups collected 
by Ken Lang, which contains about 20,000 arti¬ 
cles evenly distributed over 20 UseNet discussion 
groups. These groups correspond to the classes in 
document classification. 3 

It has been known that LDA performs better 
with long documents (Tang et al., 2014). To estab¬ 
lish a strong LDA baseline, only long documents 
are selected for training and test in this study. 
Considering that 20 Newsgroups is much larger 

'http://deeplearning.net/software/ 
theano/ 

2 https://kdd.ics.uci.edu/databases/ 
reuters21578/reuters21578.html 

3 http://www.cs.emu.edu/afs/cs.emu.edu/ 
project/theo-2 0/www/data/news2 0.html 



than Reuters-21578, different selection criteria are 
used to choose documents for the two datasets, as 
shown in Table 1. The table also shows the lexicon 
size in the LDA and DNN modeling, which cor¬ 
responds to the dimensionality of the TF feature. 
Note that this seemingly tricky data selection is 
just for building a strong LDA model for the DNN 
to learn, rather than intensively selecting a work¬ 
ing scenario for the proposed method. In fact, the 
DNN learning works well with any LDA teacher 
model, and the performance of the resultant DNN 
largely depends on the quality of the teacher LDA. 



Reuters 

20 News 

Document length threshold 

100 

300 

Training documents 

3622 

6312 

Test documents 

1705 

1542 

Word frequency threshold 

30 

200 

Lexicon size (words) 

2388 

1910 


Table 1: Data profile of the experimental datasets. 


4.2 Results 

To evaluate the proposed transfer learning, we 
compare the classification performance with 
the document vectors inferred from the LDA- 
supervised DNN and the original LDA. The sup¬ 
port vector machine (SVM) with a linear ker¬ 
nel is used as the classifier. Since LDA is the 
teacher model, its performance can be regarded as 
a upper bound of the DNN learning. Addition¬ 
ally, we choose the popular principle component 
analysis (PCA) (Jolliffe, 2002) as another base¬ 
line and regard it as a low bound of the learning. 
All these three methods generate low-dimensional 
document vectors and are comparable in the sense 
of dimension reduction. Note that in many cases 
LDA does not outperform PCA, though it is not 
the focus of our study. What we are concerned 
with is that in the case where LDA is superior to 
PCA, the learned DNN can keep this superiority, 
but with much less computation cost. 

4.2.1 Document classification 

The results in terms of classification accuracy on 
the two datastes are reported in Figure 1, where 
the number of topics varies from 10 to 70. We first 
observe that LDA obtains better performance than 
PCA on both the two datasets. Again, this is partly 
attributed to the long documents used in the study. 
The two DNN models obtain similar performance 
as LDA and outperform PCA, particulary with a 




Figure 1: The classification accuracy of PCA, 
LDA, 2-layer DNN (DNN-2L) and 3-layer DNN 
(DNN-3L). 

small number of topics. This indicates that the 
DNNs indeed learned the behavior of LDA. If the 
number of topics is large, the DNN models work 
not as well, possibly because the limited amount 
of training data (just several thousands of training 
samples) can not afford learning complex models. 

Note that the 3-layer DNN outperforms the 2- 
layer DNN. This indicates that deeper models can 
learn the LDA behavior more precisely. This can 
be evaluated more directly in terms of KL diver¬ 
gence between the LDA output 6{d) and the DNN 
prediction f(v(d ); w), as shown in Figure 2. 

4.2.2 Inference speed 

The comparative results on inference time are 
shown in Figure 3. The experiments were con¬ 
ducted on a desktop with 4 3.4G Hz cores, and 
to alleviate randomness the experiments were con¬ 
ducted 10 times and the averaged numbers are re¬ 
ported. It can be seen that the DNN model is much 
faster (10 to 200 times) than the original LDA, 
and the superiority is more clear with a large num¬ 
ber of topics. Comparing the results on the two 
datasets, we observe that DNN exhibits more ad¬ 
vantages on 20 Newsgroups, because the long doc¬ 
uments of this dataset are more difficult to infer 
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Figure 2: The averaged KL divergency between 
the DNN and LDA output calculated on the test 
data of Reuters-21578 and 20 Newsgroups. 



Figure 3: The ratio of inference time of LDA to 
DNN. 

with LDA. Additionally, the 3-layer DNN is not 
much slower than the 2-layer DNN, which means 
that using deeper models does not cause much ad¬ 
ditional computation. 

5 Topic discovery by transfer learning 

A known advantage of DNNs is that high-level 
representations can be learned automatically layer 
by layer. This property may help DNN to discover 
topics from the raw TF input. To verify this con¬ 
jecture, a one-hot vector is given to the DNN in¬ 
put, and the activation on each hidden neuron is 



Figure 4: Discovery for the topic ‘mining’ with 
DNN. The words in dark are topic related words. 

recorded. The one-hot vector represents a particu¬ 
lar word, and the activation reflects how a particu¬ 
lar neuron is related to this word. For each neuron, 
we record the activations of all the words and se¬ 
lect the top-10 words that give the most significant 
activations, which forms the set of representative 
words for the neuron. 

Interestingly, we find that for each neuron, 
the representative words are generally correlated, 
forming a local topic. Figure 4 shows an exam¬ 
ple, where the topic ‘mining’ at the second hid¬ 
den layer is formed by aggregating the related top¬ 
ics at the first hidden layer. This example shows 
clearly how words are clustered layer by layer to 
form semantic meaningful topics. Interestingly, 
we find that the topics derived from DNN and 
LDA are quite similar, and the DNN-derived top¬ 
ics look more reasonable. As an example, the top- 
10 words for the topic ‘mining’ derived from LDA 
are {gold, said, mine, copper, ounces, mining, 
tons, ton, silver, reuter}, while the DNN-derived 
top-10 words are {gold, copper, mine, mining, 
silver, zinc, minerals, metal, mines, ton}. 

6 Conclusion and future work 

We proposed a knowledge transfer learning 
method that uses deep neural networks to approx¬ 
imate LDA. Results on a document classification 
task show that a simple DNN can approximate 
LDA quite well, while the inference is tens or hun¬ 
dreds of times faster. This preliminary research in¬ 
dicates that transferring knowledge from Bayesian 
models to neural models is possible. 

The future work involves studying knowledge 
transfer between more complex probabilistic mod¬ 
els and other neural models. Particularly, we are 
interested in how to use the knowledge of proba¬ 
bilistic models to regularize neural models so that 
the neurons are more interpretable. 
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